Drop (leading) windows in MSM where all coefficients zero #479
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: CI is broken, because I temporarily added a parameter
useZeroWindows
to the MSM procs.This implements the same optimization as done in:
privacy-scaling-explorations/halo2curves#168
for the single threaded MSM implementations.
Essentially, if all coefficients
a_i
are smaller than some valuex < p
(withp
the prime field order), their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result.After writing a benchmark comparing the old implementation using all windows vs. the optimization (and hence added a new argument that currently breaks the CI), I noticed that the performance improvements are pretty minor. In fact, even without this optimization our improvements are massive for inputs with only small coefficients. The gains beyond that of skipping windows only provides an additional marginal improvement.
Here is a plot showing the time for an MSM with a different number of points and number of set bits. Compared is the reference implementation (using
b
prefix for baseline) with the optimized implementation. Both split by either using all windows or only using non zero leading windows. What is actually quite interesting is that for the reference implementation the improvements generally are much larger, pushing it in line with the optimized implementation.(I ran the bench on my laptop with a i7-8750H)